Benchmarking of microbiome detection tools on RNA-seq synthetic databases according to diverse conditions

Abstract Motivation Here, we performed a benchmarking analysis of five tools for microbe sequence detection using transcriptomics data (Kraken2, MetaPhlAn2, PathSeq, DRAC and Pandora). We built a synthetic database mimicking real-world structure with tuned conditions accounting for microbe species prevalence, base calling quality and sequence length. Sensitivity and positive predictive value (PPV) parameters, as well as computational requirements, were used for tool ranking. Results GATK PathSeq showed the highest sensitivity on average and across all scenarios considered. However, the main drawback of this tool was its slowness. Kraken2 was the fastest tool and displayed the second-best sensitivity, though with large variance depending on the species to be classified. There was no significant difference for the other three algorithms sensitivity. The sensitivity of MetaPhlAn2 and Pandora was affected by sequence number and DRAC by sequence quality and length. Results from this study support the use of Kraken2 for routine microbiome profiling based on its competitive sensitivity and runtime performance. Nonetheless, we strongly endorse to complement it by combining with MetaPhlAn2 for thorough taxonomic analyses. Availability and implementation https://github.com/fjuradorueda/MIME/ and https://github.com/lola4/DRAC/. Supplementary information Supplementary data are available at Bioinformatics Advances online.


Supplementary File II: DRAC (Discarded Reads Alignment and Coverage) pipeline
DRAC is a pipeline that takes a sequencing file as input and scans for bacterial sequences in it. DRAC shares the same conceptual framework than other similar pipelines: RNA CoMPASS (Xu et al 2014), PHAT (Pathogen-Host Analysis Tool, Gibb et al. 2019), PAIPline (Andrusch et al., 2018 and SLIMM (Dadi et al., 2017) among others. The dual study of pathogen and host genome/transcriptome from NGS files is an application that has gained interest in the scientific community, because two feature layers can be extracted in parallel. For instance, from a unique RNA-Seq file one can study the host transcriptome and the discovery of pathogens as well (see Xu et al 2014). Pathogen-host NGS analysis allows detection of microbes without prior hypothesis about possible causative/contaminant agents.
The workflow DRAC follows involves several consecutive steps: sequence alignment to the host genome, read cleaning and filtering, mapping against a database of interest (RefSeq representative genomes), and on top of that, DRAC computes the coverage along every reference genome. That said, DRAC is not a metagenomics classifier but, instead, a pipeline that was developed to scan NGS files for putatively present microbiota or contaminant microbes.
DRAC is conceived as a post-processing tool to be run on a NGS file (either fastq or SAM/BAM format) of previously discarded reads that could not be aligned to the host genome (it may be human or any other host). It is mostly devised to be run on a single organism tissue RNA-Seq, but also Whole Exome Sequencing or Whole Genome Sequencing are supported, rather than a high-biomass metagenomic design (i.e., it is not recommended for high biomass studies, e.g., feces or saliva). This is because the program that DRAC uses for identifying the microbes is the Burrows-Wheeler Aligner (BWA) that it is not designed for that sort of task, besides of being slow due to the alignment process. However, BWA its very appropriate to align a relatively small subset of Illumina single/paired-end sequences from previously aligned reads to the human/host reference genome, against the complete bacterial genomes in the RefSeq database, as was performed in a seminal paper that identified bacteria that may be present in human tumor samples (Robinson et al 2017, DOI: 10.1186 Since with BWA identifications one cannot conclude whether the reads are true or false positives, DRAC pipeline subsequently applies a method to ensure that counts belonging to a particular genome are well spread and covering a directly proportional part of it, as a proof that they are more reliable. DRAC is a binner tool that takes a BAM file as input (with a minor change it can start from a non-host reads fastq file) keeping both the host aligned reads and the unmapped (discarded by the aligner, for whatever reason) that DRAC classifies by the bitwise flag (samtools). Afterwards, DRAC trims the still remaining adapters, the low complexity sequences with the DUST algorithm, and the low quality tails, and dismisses the short reads after this process, saving the clean ones in a fastq file.
Since the remaining reads are unsorted and one of the pairs members is sometimes lost, the pipeline fixes the pairs to keep only the paired reads. It also saves the singletons in case the user needs to analyze them in a further step.
BWA aligns the unmapped but sorted paired-end reads against a reference database. In the article that we present here, a database was previously built (BWA-indexed) with the bacterial genomes downloaded from RefSeq and, in particular, the set of Representative Genomes (one assembly per defined species selected as representative, with criteria explained in https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#representative_genomes).
Other similar pipelines (e.g., SLIMM performs a species genome selection when multiple genomes are available).
After aligning the reads, the program extracts the counts per genome and joins them in a sample counts file. The pipeline can also join the counts of several samples in one single table. Afterwards, DRAC uses bedtools coverageBed for computing the coverage along the genomes represented in the sequence alignments of the resulting BAM file. The computed coverages are joined in a sample file (again, individual sample or merged when there is a group of samples in the analysis).
A unique aspect of DRAC pipeline is that it takes into account the coverage of a particular bacterial genome and evaluates the expected vs. the observed coverage. DRAC assumes a constant read length (the average length, also calculated by the method), and it also assumes that reads should be widespread or scattered along the genome for them to be considered reliable, similar to KrakenUniq and SLIMM, discarding unlikely genomes based on coverage landscape. This way, those methods eliminate genomes that have a stack of reads only in few places across their genomes, which could be a result of sequencing artifacts, spike-ins, or a conserved region among distant relatives.
By knowing how long the reads are and how much they should cover if they are fairly spread (Expected Coverage = read length x number of reads), we can calculate an internal score to discard those reads with low coverage (according to what it would be expected by the abovementioned formula). By doing so, DRAC penalizes bacterial assignments with very poor coverage and dismisses them, while it will retain the most reliable (scattered through the genome) ones, that most probably are true positives.
The pipeline also performs some calculations, proportions, and creates a global table in case of scanning several samples. In the last stages, DRAC performs the count normalization and statistical analysis.
DRAC scripts are written in Bash, but it also has pieces of code in Python, and R languages, and the project can be found on the GitHub repository (https://github.com/lola4/DRAC), as well as documentation and explanatory diagrams ( Supplementary Figures S9 and S10).   Figure S1: GATK PathSeq sensitivity values a cross the 4-changing conditions scenarios. In the plot, the median is represented by the line whereas the mean is represented by the diamond. Wilcoxon rank-sum test was applied across scenarios.

References cited in
Supplementary Figure S2: DRAC sensitivity values across the 4-changing conditions scenarios. In the plot, the median is represented by the line whereas the mean is represented by the diamond. Wilcoxon rank-sum test was applied across scenarios.  Figure S3: PANDORA sensitivity values a cross the 4-changing conditions scenarios. In the plot, the median is represented by the line whereas the mean is represented by the diamond. Wilcoxon rank-sum test was applied across scenarios Supplementary Figure S4: T-test applied to kraken's results. The "Low prevalence Bad Quality" scenario was not included to the comparison because it cannot be compared with the rest of the high prevalence scenarios (i.e., sequence's number is tenfold smaller). Bottom left: a stack of reads aligning to the same genome region is shown (Integrative Genomics Viewer snapshot), along 6 different BAM files, which would be an example of poor coverage and not reliable microbe. Bottom right: IGV snapshot showing 5 BAM files with reads aligning to the same genome region, but in this case the reads are scattered and widespread, which is a proof of reliability following the assumptions of the method.
Supplementary Figure S10: DRAC penalization scheme showing an example of a counts table (left: samples in columns, bacteria in rows) and a coverages table calculated by bedtools coverageBed (right: samples in columns, bacteria in rows). The blue matrix shows the Expected coverage, which results from multiplying the counts table by a scalar, the average read length (e.g., 100 bp). The Observed / Expected table is computed by the element-wise division of two matrices, the Observed coverageBed table, by the Expected values. This new matrix elements have a range of 0 to 1, being values close to zero, genomes with poor coverage (non-trustworthy) and values close to one genome with high coverage (reliable microbes). The software needs a user-provided threshold to decide the genomes that will be penalized down to zero. In this example and for most applications, a value of 0.4 or higher, is recommended, being threshold values close to 1 the most restrictive ones, while on the contrary, a 0 value for the threshold will not penalize at all. This threshold will be used as the cutoff value to create a new binary matrix, with only 0 and 1 values, being the 0 elements, genomes that will be dismissed, and 1, genomes that will be trusted. The binary matrix will multiply the observed counts table to yield the final count table for downstream statistical analyses.